perf: profile baseline -- redirect from OP/blitter to GPU/DSP RISC by github-actions[bot] · Pull Request #128 · libretro/virtualjaguar-libretro

github-actions · 2026-05-01T23:52:02Z

Summary

Set up the OP perf branch per the plan, then profiled before writing any code. The profile data redirects the entire perf phase: OP isn't the bottleneck (1% on Iron Soldier, doesn't crack top-15 elsewhere); blitter isn't either except on the opt-in accurate path; GPU and DSP RISC interpretation dominate every ROM tested.

This PR is just the doc + the Makefile fix to make commercial ROM paths work in make benchmark. The headline finding lives in docs/op-perf-profile.md.

Profile (TL;DR)

ROM	Hot subsystem	% of frame
yarc.j64 (demo)	`GPUExec`	~74%
Iron Soldier (commercial)	`DSPExec` + opcodes	~67%
Skyhammer (fast blitter)	`GPUExec` + `DSPExec`	~57% combined
Skyhammer (accurate blitter)	`BlitterMidsummer2` + DSP + GPU	21% / 22% / 11%
OP across all ROMs	not in top-15, ~1% on Iron Soldier	n/a
68K across all ROMs	`m68k_execute` 0.7-2.6%	n/a

What this changes

Issue [spike] Object Processor (OP) performance audit + optimization plan #123 (OP perf) — close as wontfix-for-now. The cheap wins (O(N²) discovery scan, hoist transparency check) are still valid as code-quality fixes but won't move the headline number.
Issue [spike] Blitter performance audit (fast + accurate) + SIMD widening plan #124 (blitter) — narrow scope. Fast-blitter SIMD widening was the original ask, but fast blitter is <2% almost everywhere. The accurate-blitter perf is real (21% on Skyhammer) but only matters to users who deliberately enable that mode. Keep open as low-priority "someone-might-want-this".
Issue [spike] JIT / dynarec / cached IR for Jaguar CPUs (68000, GPU, DSP) #122 (JIT/dynarec/cached IR) — re-scope to GPU/DSP RISC only. Both share the Tom RISC ISA so a single dispatcher attacks both. 68K is consistently <3% so 68K JIT (and the GPLv2 license dance with UAE-JIT) isn't worth it.

Methodology

make benchmark BENCH_ROM=<rom> BENCH_BLITTER={fast,accurate} — 600 frames + 60 warmup, headless, no presentation
/usr/bin/sample <pid> 8 -file <out> over a 6000-frame benchmark run
6 ROMs: yarc, jagniccc, Iron Soldier 1/2, Doom, Skyhammer
Apple Silicon, default release flags + RELEASE_DEBUG_INFO=1

Test plan

All 6 ROMs benchmark without errors after the BENCH_ROM quoting fix
Sample profiles capture meaningful symbols (built with -g, no inlining lost the obvious targets)
Findings reproducible — same ROM, same blitter mode → same hot function within ±5%
CI green

Branch base

Base is develop (per GitFlow)

Captured baseline FPS + sample-based hot-function breakdown across 6 ROMs (yarc, jagniccc, Iron Soldier 1/2, Doom, Skyhammer) on Apple Silicon. Result contradicts the assumptions in spikes #123 (OP) and #124 (blitter): yarc.j64 -> GPUExec ~74% of frame time Iron Soldier -> DSPExec ~67% of frame time Skyhammer fast -> GPU+DSP ~57% of frame time, blitter <2% Skyhammer acc -> blitter ~21%, but only on opt-in accurate OP -> 1% on Iron Soldier, doesn't crack top-15 elsewhere 68K -> 0.7-2.6% everywhere; JIT not worth the licensing The right next perf target is GPU/DSP RISC dynarec or cached IR (half of issue #122). Both share the same Tom RISC ISA, so a single dispatcher attacks both. docs/op-perf-profile.md captures the methodology, the ROM-by-ROM numbers, and the recommendation. Drive-by: quote $(BENCH_ROM) in the Makefile benchmark recipe so ROM paths with spaces / parens (every commercial ROM filename) work. Co-Authored-By: Claude Opus 4.7 <[email protected]>

Original profile only sampled headless boot/menu state, which doesn't exercise the blitter much. Added --load-state to test_benchmark (handles RetroArch RASTATE container by extracting the MEM chunk), loaded the user's AvP state6 save (active gameplay), and re-profiled. Result: the spike reports' hypothesis about the blitter was right -- it just didn't show up in boot profiles. AvP gameplay, fast blitter: DSP ~63% (same as before), blitter 5%, GPU 4%, OP 1% AvP gameplay, accurate blitter: BlitterMidsummer2+ADDARRAY+DATA ~34% (matches docs/spike #124), DSP ~31%, GPU small OP stays at ~1% even on gameplay -- closing #123 as wontfix-for-now is still the honest call. Updated recommendation: two genuine targets, suggested order: 1. Accurate-blitter SIMD widening (#124) -- smaller surface, lower risk, immediate visible win for AvP-style accurate-blitter slowdown. ADDARRAY (src/tom/blitter.c:2090-2094) is the obvious first target. 2. RISC dynarec / cached IR (#122 RISC half) -- bigger lift, helps every game on every blitter mode. Co-Authored-By: Claude Opus 4.7 <[email protected]>

github-actions · 2026-05-02T00:06:55Z

Regression: `macos-arm64`

Regression Test Results

ROM	Status	Details	Diff
jagniccc	✅ PASS	0 pixels differ	-
yarc	✅ PASS	0 pixels differ	-
jagniccc (determinism)	✅ PASS	identical across runs	-
yarc (determinism)	✅ PASS	identical across runs	-
jagniccc (frameskip)	✅ PASS	skip=0 matches skip=3	-
yarc (frameskip)	✅ PASS	skip=0 matches skip=3	-
jagniccc (save state)	✅ PASS	round-trip matches	-
yarc (save state)	✅ PASS	round-trip matches	-
jagniccc (rewind)	✅ PASS	rewind matches	-
yarc (rewind)	✅ PASS	rewind matches	-

Platform: Darwin arm64

_{Updated by CI at 2026-05-02T02:39:26.395Z}

github-actions · 2026-05-02T00:37:10Z

Regression: `linux-x64`

Regression Test Results

ROM	Status	Details	Diff
jagniccc	✅ PASS	0 pixels differ	-
yarc	✅ PASS	0 pixels differ	-
jagniccc (determinism)	✅ PASS	identical across runs	-
yarc (determinism)	✅ PASS	identical across runs	-
jagniccc (frameskip)	✅ PASS	skip=0 matches skip=3	-
yarc (frameskip)	✅ PASS	skip=0 matches skip=3	-
jagniccc (save state)	✅ PASS	round-trip matches	-
yarc (save state)	✅ PASS	round-trip matches	-
jagniccc (rewind)	✅ PASS	rewind matches	-
yarc (rewind)	✅ PASS	rewind matches	-

Platform: Linux x86_64

_{Updated by CI at 2026-05-02T02:39:45.599Z}

github-actions · 2026-05-02T00:49:43Z

Regression: `linux-arm64`

Regression Test Results

ROM	Status	Details	Diff
jagniccc	✅ PASS	0 pixels differ	-
yarc	✅ PASS	0 pixels differ	-
jagniccc (determinism)	✅ PASS	identical across runs	-
yarc (determinism)	✅ PASS	identical across runs	-
jagniccc (frameskip)	✅ PASS	skip=0 matches skip=3	-
yarc (frameskip)	✅ PASS	skip=0 matches skip=3	-
jagniccc (save state)	✅ PASS	round-trip matches	-
yarc (save state)	✅ PASS	round-trip matches	-
jagniccc (rewind)	✅ PASS	rewind matches	-
yarc (rewind)	✅ PASS	rewind matches	-

Platform: Linux aarch64

_{Updated by CI at 2026-05-02T02:39:56.368Z}

Copilot

Pull request overview

Establishes a performance baseline for Virtual Jaguar’s OP/blitter vs GPU/DSP execution, and updates the benchmark harness so commercial ROM paths work reliably (and benchmarks can load a save state to profile gameplay rather than boot/menu idle).

Changes:

Add a new profiling write-up documenting measured hotspots and updated optimization priorities (docs/op-perf-profile.md).
Extend the headless benchmark tool with --load-state support (raw retro_serialize() payload or RetroArch RASTATE “MEM” chunk).
Quote $(BENCH_ROM) in make benchmark so paths with spaces work.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
`test/tools/test_benchmark.c`	Adds `--load-state` and RASTATE MEM-chunk extraction to allow profiling from in-game states.
`docs/op-perf-profile.md`	New documentation capturing the baseline/profile results and recommended perf focus.
`Makefile`	Quotes `$(BENCH_ROM)` in the benchmark target invocation to support ROM paths with spaces.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+         uint8_t *st_buf;
+         const uint8_t *payload;
+         size_t payload_size;
+         size_t expected;
+         fseek(stf, 0, SEEK_END);
+         st_total = ftell(stf);
+         fseek(stf, 0, SEEK_SET);
+         st_buf = (uint8_t *)malloc(st_total);
+         if (fread(st_buf, 1, st_total, stf) != (size_t)st_total)
+         {
+            fprintf(stderr, "ERROR: short read on state file\n");
+            free(st_buf); fclose(stf); return 1;
+         }
+         fclose(stf);
+         payload = st_buf;
+         payload_size = (size_t)st_total;


+            while (p + 8 <= end)
+            {
+               uint32_t chunk_size = (uint32_t)p[4] | ((uint32_t)p[5] << 8)
+                                  | ((uint32_t)p[6] << 16) | ((uint32_t)p[7] << 24);
+               if (memcmp(p, "MEM ", 4) == 0)
+               {
+                  payload = p + 8;
+                  payload_size = chunk_size;
+                  found = 1;
+                  break;
+               }
+               p += 8 + chunk_size;
+            }


+      FILE *stf = fopen(state_load_path, "rb");
+      if (!stf)
+      {
+         fprintf(stderr, "ERROR: cannot open state file: %s\n", state_load_path);
+         return 1;
+      }


Three concerns flagged by automated review on PR #128: * Validate fseek/ftell/malloc/fread when reading the state file (negative size, NULL malloc, short read) instead of trusting them. * Bounds-check RASTATE chunk_size against the buffer end before treating any bytes past the chunk header as payload, so a truncated/corrupt container can't push retro_unserialize past the allocation. * Route every state-load failure through a single cleanup label that closes the FILE, frees the buffer, and tears down retro_load_game / retro_init / dlopen before returning, matching the cleanup in the normal exit path. No functional change for valid state files. Smoke-tested with make benchmark (no --load-state) -- still reports FPS as before. Co-Authored-By: Claude Opus 4.7 <[email protected]>

JoeMatt added performance 🚀 📖 documentation labels May 1, 2026

JoeMatt changed the title ~~Update from feature/op-perf-baseline~~ perf: profile baseline -- redirect from OP/blitter to GPU/DSP RISC May 1, 2026

JoeMatt self-assigned this May 2, 2026

JoeMatt marked this pull request as ready for review May 2, 2026 00:46

JoeMatt self-requested a review as a code owner May 2, 2026 00:46

Copilot AI review requested due to automatic review settings May 2, 2026 00:46

Copilot started reviewing on behalf of JoeMatt May 2, 2026 00:47 View session

Copilot AI reviewed May 2, 2026

View reviewed changes

JoeMatt merged commit 662b13f into develop May 2, 2026
30 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: profile baseline -- redirect from OP/blitter to GPU/DSP RISC#128

perf: profile baseline -- redirect from OP/blitter to GPU/DSP RISC#128
JoeMatt merged 3 commits intodevelopfrom
feature/op-perf-baseline

github-actions Bot commented May 1, 2026 •

edited by JoeMatt

Loading

Uh oh!

github-actions Bot commented May 2, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 2, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 2, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

github-actions Bot commented May 1, 2026 • edited by JoeMatt Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Profile (TL;DR)

What this changes

Methodology

Test plan

Branch base

Uh oh!

github-actions Bot commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Regression: macos-arm64

Regression Test Results

Uh oh!

github-actions Bot commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Regression: linux-x64

Regression Test Results

Uh oh!

github-actions Bot commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Regression: linux-arm64

Regression Test Results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented May 1, 2026 •

edited by JoeMatt

Loading

github-actions Bot commented May 2, 2026 •

edited

Loading

Regression: `macos-arm64`

github-actions Bot commented May 2, 2026 •

edited

Loading

Regression: `linux-x64`

github-actions Bot commented May 2, 2026 •

edited

Loading

Regression: `linux-arm64`